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ABSTRACT 

Background Epistasis has been historically used to 
describe the phenomenon that the effect of a given 
gene on a phenotype can be dependent on one or more 
other genes, and is an essential element for 
understanding the association between genetic and 
phenotypic variations. Quantifying epistasis of orders 
higher than two is very challenging due to both the 
computational complexity of enumerating all possible 
combinations in genome-wide data and the lack of 
efficient and effective methodologies. 
Objectives In this study, we propose a fast, 
non-parametric, and model-free measure for three-way 
epistasis. 

Methods Such a measure is based on information 
gain, and is able to separate all lower order effects from 
pure three-way epistasis. 

Results Our method was verified on synthetic data and 
applied to real data from a candidate-gene study of 
tuberculosis in a West African population. In the 
tuberculosis data, we found a statistically significant pure 
three-way epistatic interaction effect that was stronger 
than any lower-order associations. 
Conclusion Our study provides a methodological basis 
for detecting and characterizing high-order gene-gene 
interactions in genetic association studies. 



INTRODUCTION 

Understanding the mapping from genetic variation 
to phenotypic variation has a great potential 
helping us understand, predict, diagnose, and treat 
common human diseases. However, existing 
main-effect-centered methodologies and techniques 
that depend on fundamental assumptions about a 
simple genetic architecture can only find very 
limited individual associations with disease risks. 
Genome-wide association studies 1-3 and next 
generation sequencing 4 make millions of single 
nucleotide polymorphisms (SNPs) in the human 
genome available for testing associations with 
phenotypic traits. These developments call for new 
methodologies that embrace the complex genetic 
architecture of diseases. 5 6 The non-additive effects 
of gene-gene interactions, that is, epistasis, are 
believed to be an important contributor to the 
complex relationship between genetic and pheno- 
typic variations. 7-11 The focus of recent disease 
association research is shifting from identifying 
single locus susceptibility to quantifying interaction 
effects between multiple candidate loci throughout 
the human genome. 8 9 12 



Detecting and quantifying epistasis is a very chal- 
lenging task. First, the epistatic interactions could 
involve multiple genes from a pair to a large set, 
and this undetermined order of interactions 
imposes enormous computational complexity of 
enumerating all possible combinations of genetic 
attributes for varying orders. 6 13 Second, as the 
order of interacting genetic attributes goes beyond 
two, it becomes mathematically difficult to separate 
the additive lower-order effects and the pure 
higher-order synergy, that is, the extreme case in 
which the association can only be observed when 
all attributes are considered together. Those are 
also the major reasons why most existing epistasis 
studies are limited to pairwise interactions on gen- 
etics data of moderate sizes. 

Information-theoretic measures have emerged as a 
very useful tool to quantify synergistic interactions 
among multiple genetic attributes. 14-21 These mea- 
sures are based on the Shannon entropy, which 
quantifies the amount of information, or uncertainty, 
of a random variable. 22 By considering genetic attri- 
butes and phenotypic traits as random variables, 
entropy-based information-theoretic measures can 
be used to quantify the shared information between 
one gene and a trait, i.e., the main effect, as well as 
the gained extra information about a trait obtained 
from combining multiple genes, i.e., the synergistic 
effect or epistasis. However, as discussed previously, 
due to the mathematical complexity, the application 
of information-theoretic measures in disease associ- 
ation studies is mainly limited to pairwise epistasis 
between two genetic attributes. 

In this study, we propose a new measure to quan- 
tify the synergistic effects among three genetic attri- 
butes that contribute to disease susceptibility by 
extending the information-theoretic measure for 
pairwise synergy. In particular, we first measure the 
total amount of information that three attributes 
together can provide about the phenotypic status, 
and then subtract all lower-order effects including 
the main effects of the three attributes and all pair- 
wise synergies between them. This yields a very 
strict measure of pure three-way epistasis. There 
have been previous attempts at extending 
information-theoretic measures on three-way and 
higher-order synergies. 14 17 23 24 However, most 
existing measures are not able to decouple lower- 
order interaction effects from the higher-order 
effects and the formalization of higher-order 
synergy is still debatable. We compared our new 
measure to those existing ones and were able to 
show that our measure performed the best at 
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separating all lower-order effects from the pure three-way epis- 
tasis by applying to both synthetic data containing artifact epis- 
tasis models and real human disease data from a candidate-gene 
association study of tuberculosis in a population from West 
Africa. 25 Of particular interest, we identified a statistically sig- 
nificant three-way epistasis model in the tuberculosis data, in 
which the three-way synergy is stronger than all the main effects 
and pairwise-interaction effects combined. With further bio- 
logical verification and interpretation, this model could be very 
valuable in advancing tuberculosis research. 

METHODS 

Information-theoretic measures 

In information theory, 22 entropy is a measure of the uncertainty 
of a random variable. It can be explained as the amount of 
information required on average to describe a random variable. 
For a discrete variable X with alphabet X and probability mass 
function p(x), its entropy H(X) is defined as 

H(X) = -£p(x)logp(x). 

When there are more than one random variable, the defin- 
ition of entropy can be extended as follows. The joint entropy 
of two discrete random variables X and Y with a joint distribu- 
tion p(x,y) is defined as 

H(X,Y) = -££p(x,y)logp(x,y), 



the information gain IG(Gi; G2; C) is the gain of mutual informa- 
tion of knowing both Gi and G2 with respect to the class C. A 
positive value of IG(Gi; G2; C) indicates the synergy between Gi 
and G2, while a negative value indicates the redundancy or correl- 
ation between them. The synergy can be well explained using the 
epistatic interactions between two genetic attributes. As discussed 
previously, this pairwise information-gain measure has been suc- 
cessfully applied in many epistasis studies thanks to its model-free, 
non-parametric, and fast implementation. 

Further extension of the information-gain measure on more 
than two genetic attributes is non-trivial for epistasis studies 
because many complex human diseases could very likely involve 
genetic interactions of orders higher than two way. There is no 
widely accepted formal definition of information gain including 
genetic attributes higher than two. Here, we make an effort meas- 
uring three-way synergistic interactions using information gain. 

In a previous attempt, Anastassiou 14 and Varadan et al 24 pro- 
posed to define the three-way information gain by comparing 
the integrated joint mutual information to the best-achieved 
subsets mutual information after breaking the whole into parti- 
tions, mathematically written as 



IG, 



partition 



(Gi;G 2 ;G 3 ;C) = I(Gi,G2,G3;C) 

I(G 1 ,G 2 ;C) + I(G 3 ;C) 
I(G 1 ,G 3 ;C) + I(G 2 ;C) 
I(G 2 ,G 3 ;C) + I(G i; C) 
I(G i; C) + I(G 2 ;C) + I(G 3 ;C). 



and the conditional entropy of X given the knowledge of Y can 
be obtained by the chain rule as 

H(X|Y) = H(X,Y) - H(Y). 

The dependency between two random variables can be 
described using mutual information. 22 This is a measure of the 
amount of information that one random variable contains about 
the other, or can be thought of as the reduction of uncertainty 
of one random variable given the knowledge of the other. In 
the context of genetic association studies, mutual information 
can be very useful to quantify how much of a phenotypic status 
is explained by genotypic variations. We consider a genetic attri- 
bute Gi and the phenotypic class C, for example, case or 
control, are both discrete random variables. The mutual infor- 
mation I(Gi;C) measures the reduction in the uncertainty of 
the class C due to the knowledge about the genotype of Gi 
(figure 1A), defined as 



The partition of the set {Gi,G2,G3} chosen in this formula 
is the one that maximizes the sum of the amounts of mutual 
information connecting the subsets with the phenotypic class. 
This is referred to as 'maximum-information partition' of the 
set {Gi,G2,G3} with respect to the class C. This three-way 
IG P artition(Gi; G2; G3; C) quantifies the information that can be 
gained by combining Gi,G2 and G3 together comparing to its 
maximum-information partition. Although technically sound, 
this formula might include false-positive errors of pure three- 
way epistasis. For instance, assuming I(Gi,G2, C) and I(G3;C) is 
the maximum-information partition, after combining G3 with 
{Gi,G2}, the gained information could be the result of either 
the pure three-way epistasis, or the pairwise epistasis between 
G3 and one (or both) of {Gi,G2}, or the mixture of all above. 
A more strict alternative measure 23 was proposed as follows 



I(G 1 ;C) = H(C)-H(C|G 1 ). 

Intuitively, I(Gi; C) can be used as a measure of the main effect 
of the genetic attribute Gi on the class C. 

Mutual information can also be extended to measure the epi- 
static interaction effect between two attributes. Given two 
genetic attributes Gi and G2, the mutual information 

I(Gi, G 2 ; C) = H(C) - H(C|Gi, G 2 ) 

measures how much of the phenotypic class joining Gi and G2 
together can explain (figure IB). By subtracting the individual main 
effects of Gi and G2 from their joint effect I(Gi,G2", C), that is, 

IG(Gi; G 2 ; C) = I(G l5 G 2 ; C) - I(G i; C) - I(G 2 ; C), 



IGakernative (Gl ; G2; G3 ; C) = I(Gi,G2,G3; C) 

-IG(Gi ; G 2 ; C) - IG(Gi ; G 3 ; C) 
-IG(G 2 ; G 3 ; C)-I(G i; C)-I(G 2 ; C) 
-I(G 3 ;C), 



where all the lower-order effects are subtracted. However, as 
reviewed and pointed out in Anastassiou, 14 this formula fails in the 
extreme redundancy case where all Gi, G2, and G3 provide the 
same full amount of information on C, that is, 
G 1 = = Cm = C. In this case, 

I(Gi; C) = I(Gi, Gj; C) = I(Gi, Gj, G^; C) = H(C), where i, j, k are 
different values taken from {1,2,3}. Therefore 
IG a i ternat i ve (Gi; G2; G3; C) = H(C), which indicates the contradict- 
ory extreme synergy. 
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Figure 1 Venn diagrams showing the 
entropy and mutual information of 
genetic attributes G and the 
phenotypic class C. (A) For one 
attribute and the phenotypic class 
C, their entropies H(G 1 ) and H(C) are 
indicated as the two colored sets. The 
mutual information l(Gi;C) is defined 
as the intersection of the two sets, and 
the joint entropy H(G 1f C) is the reunion 
of the two sets. (B) For two genetic 
attributes Gi, G 2 and the phenotypic 
class C, the mutual information 
l(Gi , G 2 ; C) is the intersection of 
entropy H(C) and joint entropy 
H(G 1 ,G 2 ). 



H(C) 



I(Gi;C) 




B 



H(C) 



I(G U G 2 ;C) 




H(G 2 ) 



In our study, we propose a new strict measure by modifying 

IG a | ternat i ve as 



IG s trict (Gi ; G2 ; G3 ; C) = I(Gi , G2 , G3 ; C) 

riG(G i; G 2 ;C) f] 
— max< v q — max< 

-I(G i; C)-I(G 2 ;C)-I(G 3 ;C) 



max |IG(G 1 ,G 2 ;C) _ max |IG(G 1 ,G 3 ;C) _^ f IG(G 2 ,G 3 ;C) 



We only subtract pairwise synergies, that is, positive infor- 
mation gain, because the failure of IG a i tern ative is due to the fact 
that it adds back information by subtracting negative informa- 
tion gain. By subtracting all lower-order effects and synergies, 
IG s trict measures the pure three-way synergy that is observable 
only by considering three attributes together. 

Also note that, when applying to genetics data, the above 
mutual-information and information-gain measures can be nor- 
malized by dividing the class entropy H(C). The normalized 
measures range from [—1,1], and provide the percentage of 
explaining the phenotypic class C by giving the knowledge of 
one or multiple genetic attributes. 

Datasets 

We applied both three-way synergy measures IG part i t i 0 n and 
IG s trict to synthetic datasets and a real dataset on pulmonary 



tuberculosis from a West African population. The synthetic data- 
sets were generated using genetic architecture model emulator 
for testing and evaluating software (GAMETES), 26 27 a direct 
approach to simulating bi-allelic n-locus epistatic models. In 
GAMETES, an n-locus epistasis model is generated determinis- 
tically with specified genetic constraints such as heritability and 
minor allele frequencies, and then a population of samples can 
be generated for that model. Using GAMETES, we can have 
synthetic data of models with epistasis at desired orders, which 
are ideal for testing and comparing the two three-way synergy 
measures. We first generated a pure three-way epistasis model 
{Pi,P 2 ,Ps}, where the association to phenotypic status was only 
observable when all three SNPs were considered together, that 
is, no main effects and no pairwise interactions. The total herit- 
ability of combining three SNPs was set to 0.27, and the minor 
allele frequencies of all three SNPs were set to 0.2. A corre- 
sponding dataset was generated by including 100 SNPs in total 
with 97 randomized SNPs, {Ni,N 2 ,N 3 , . . . ,N 97 }, to provide a 
null distribution. This synthetic dataset had 400 cases and 400 
controls. Then we used GAMETES to generate a collective pair- 
wise epistasis model {P 1 ,P 2 ,P 3 }, in which any two of the three 
attributes had strong pairwise interactions but there was no epis- 
tasis at the three-way level. Again there was no main effect for 
each attribute. The heritability of combining three SNPs was set 



Figure 2 Information-theoretic 
measures (normalized information gain 
and mutual information representing 
the genetic associations to the case- 
control status) of the synthetic datasets 
generated by GAMETES. (A) The first 
synthetic dataset that has a pure 
three-way epistasis model {Pi , P 2 , P3} 
with no main effects and no pairwise 
epistasis. (B) The second synthetic 
dataset with a collective pairwise 
epistasis model {?\ , P 2 , P 3 }, in which 
there are no main effects and no 
three-way epistasis but all the 
pairs {P;,P 2 }, {P;,P 3 } f and {P 2 ,P 3 } 
have interaction effects. Points in red 
represent the observed values of the 
model {P; , P 2 , P3} in (A) and 
{?\ , P 2 , P 3 } in (B). Box plots in black 
show the null distributions of 
randomized single nucleotide 
polymorphisms. 




IG sM ct pair synergy main effect 
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to 0.32, and the heritability of combining any two SNPs, that is, 
{PijPj}, {Pi,P 3 }, and {V' 2 ,V' 3 }, was about 0.1. The minor allele 
frequencies for all three SNPs were again set to 0.2. The second 
synthetic dataset also included 100 SNPs in total with 97 rando- 
mized SNPs and had 400 cases and 400 controls. 

The pulmonary tuberculosis data are from a case-control 
study 25 conducted at the Bandim Health Project in Bissau, the 
capital city of Guinea-Bissau. This area has a high prevalence of 
pulmonary tuberculosis and tuberculosis symptoms. 28 
Tuberculosis is one of the highest mortality diseases due to the 
infection of Mycobacterium tuberculosis. However, the majority 
of infected individuals keep the bacterium under control and 
never develop a clinical disease. Genetic variation among indivi- 
duals is a promising direction to look into the factors that could 
influence the susceptibility to develop tuberculosis. Tuberculosis 
patients included in this study were residents or long-term 
guests of Bissau aged 15 years or older. From November 2003 
to November 2005, 438 tuberculosis patients were screened at 
local health centers. A total of 344 subjects met the inclusion 
criteria and accepted participation, and DNA samples were suc- 
cessfully collected from 321 of them. Healthy controls were 
recruited from the study area with certain exclusion criteria, 
such as history of tuberculosis and household tuberculosis 
records within the past 2 years. Three hundred and forty-seven 
DNA samples were obtained from the healthy control group. 
All DNA samples were extracted using a standard salting-out 
procedure. DNA purities were estimated spectrophotometrically, 
and final concentrations were determined by PicoGreen. 
Samples (4 ng of DNA) were genotyped by TaqMan SNP assays 
(ABI; Applera International Inc, Foster City, California, USA) in 
10 ul reaction volume, using the Rotor-Gene 3000 (Corbett 
Robotics Pty Ltd, Brisbane, Queensland, Australia) and the ABI 
7500 real-time PCR systems. Fluorescence curves were analyzed 
with the Rotor-Gene Software V6 and the 7500 Sequence 
Detection Software VI. 2.1 for allelic discrimination. The data 
include 19 SNPs from innate immunity genes, DC-SIGN 
(CD209), long pentraxin 3 (PTX3), toll-like receptors (TLRs), 
and vitamin D receptor (VDR), which are relevant to the defense 
against M tuberculosis. The missing genotypes (<5%) were 
imputed using a frequency-based method. The missing value of a 



sample was filled using the most common genotype of the corre- 
sponding SNP in the population. 

RESULTS 

Information-gain measurements on synthetic data 

Figure 2A shows the results of applying information-theoretic 
measures to the first synthetic dataset that had a pure three-way 
epistasis model. The points in red represent the observed values 
of the model {Pi,P2,P3}, and the box plots in black show the 
measures for the randomized SNPs {Ni,N2,N3, . . . ,N97}. 
Neither main effect nor pairwise synergy was found using the 
information-theoretic measures as the observed data points do 
not distinguish from the null distributions. In addition, both 
three-way synergy measurements IG part i t i 0 n and IG str i c t successfully 
captured the pure three-way epistasis in the model. 

The results of the second synthetic dataset with the collective 
pairwise epistasis model are shown in figure 2B. Again, the points 
in red represent the observed values of the model {P 1 ,P 2 ,P 3 }, 
and the box plots in black show the null distributions of 97 ran- 
domized SNPs. As we can see, all three pairwise synergies were 
detected. No three-way epistasis was detected by IG str i ct , but 
IGpartition reported a strong three-way epistasis among P l5 P 2 , and 
P 3 by including some portions of their pairwise synergies. 

Information-gain measurements on real data 

We used information-theoretic measures to quantify the main 
effects, two-way and three-way synergies on all possible combi- 
nations of attributes in the tuberculosis data. In addition, to 
show the collective and neighborhood structures of strong syn- 
ergistic pairs, we built a pairwise epistasis network 18 (figure 3, 
rendered by software Cytoscape 29 ). The network was con- 
structed through an incrementing process as follows. An edge 
and its two end vertices were added to the network only if their 
pairwise epistasis strength was greater than a given threshold. 
When we gradually decreased the threshold from its maximal 
observed value to its minimal value, the network started from a 
single edge with two vertices and eventually became a complete 
graph, that is, every single vertex is directly connected to every 
other vertex. We chose the highest pairwise synergy threshold, 
that is, 0.71%, when all 19 attributes were included in the 



Figure 3 The pairwise statistical 
epistasis network of the tuberculosis 
data. The network has 32 edges 
(pairwise interactions) and 19 vertices 
(single nucleotide polymorphisms; 
SNPs). The size of a vertex indicates its 
main effect, and edge width indicates 
the pairwise synergy. The color of a 
vertex denotes the gene that a SNP 
belongs to, with blue representing 
CD209, green representing VDR, pink 
representing PTX3, and yellow 
representing TLRs. The network was 
built by incrementally adding pairwise 
interactions, ranked by their strength 
(pairwise information gain), and their 
two end SNPs. This network 
construction process completed when 
all 19 SNPs were included. It thus 
shows the strongest pairwise 
interactions and the neighborhood 
structures for all 19 SNPs. 
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32 strongest pairwise interactions out of all 



=171 pairs. 



network. Therefore, we can have a map of the strongest pair- 
wise interactions showing the neighborhood structure of every 
attribute. 

Figure 3 has 19 vertices and 32 edges that represent the top 

19 
2 

Using this network, we can easily identify three-way models that 
have strong collective pairwise synergies. In graph theory, 30 the 
distance d(vi, V2) of a pair of vertices vi and V2 is defined as the 
minimal number of edges for one vertex to reach the other. 
Given three vertices vi, V2, and V3, we define their trio distance 
dtrio(vi, V2, V3) as the sum of all pairwise distances, that is, 
d t rio(vi,V2,v 3 ) = d(vi,v 2 ) + d(vi,v 3 ) + d(v 2 ,v 3 ). Therefore, for 
trios with d tr i G = 3, any two of them are directly joined by an 
edge, which indicates three strong collective pairwise interac- 
tions in a three-way model. If a trio has d tr i 0 = 4, one vertex is 
directly connected to the other two but the other two are not 
joined by an edge. A trio with d tr i 0 > 4 does not have strong col- 
lective pairwise epistasis. 

Figure 4 shows the comparison of the results of both three-way 
synergy measures. In general, IG part i t i on and IG stnct are positively 
correlated (Spearman's rank correlation p= 0.8448, 
p<2.2 x 10~ 16 ). All the data points, that is, three-way models, are 
positioned on one side of the x = y line, which indicates that the 
IGstrkt measure is always less than or equal to the IGpartition 
measure. This is intuitive because IG str ict subtracts more terms than 
IGpartition does from I(Gi, G2, G3; C). Colors indicate whether a 
trio has strong collective pairwise epistasis. As seen in the figure, 
the discrepancies (away from the x = y line) between IG pa rtition and 
IGstrkt are the most distinguishing for red data points, that is, trios 
that have either two or three strong collective pairwise interactions. 
These results also verify our previous discussion on these two three- 
way synergy measures using synthetic data. That is, IGpartition prob- 
ably includes pairwise synergies into its three-way epistasis measure 
when there are more than one pairwise synergies in a three-way 



0.03 



0.02 - 



2 0.01 



0.00 - 



0.01 



distance > 4 trios 
distance < 4 trios 




"1 1 1 — 

0.01 0.00 0.01 



0.02 



0.03 



IG 



partition 



Figure 4 Comparison of two three-way synergy measurements in the 
tuberculosis data. Each data point represents a three-way model and its 
color indicates whether the three-way model has strong collective 
pairwise epistasis. If a trio's distance is less than or equal to 4 in the 
pairwise epistasis network (figure 3), this three-way model has strong 
collective pairwise interactions among its three attributes (in red). 
Otherwise a trio does not possess strong collective pairwise epistasis 
(in black). The dashed curve represents the x=y line. 



Table 1 Spearman's rank correlation of association synergies or 
effects at different model orders in the tuberculosis dataset 





Three-way IG part ition 


Three-way IG strict 


Main effect 


p=0.0588 


p=0.0508 




p = 0.001 5 


p = 0.0062 


Pairwise synergy 


p=0.3278 


p=0.0565 




p=< 2.2x10" 16 


p = 0.0023 



model, and IG^itf only captures the pure three-way epistasis that 
are beyond any lower-order synergies or effects. 

We also looked into the correlations between the three-way 
synergies and lower-order synergies or effects (table 1). Both 
three-way synergy measures do not correlate with individual 
main effects. However, the three-way IG part i t i 0 n shows a correl- 
ation with pairwise synergy but IG str i ct does not. This further 
confirms our previous discussion on the differences between 
these two three-way synergy measures. 

The best three-way models using those two synergy measures are 
reported in figure 5. The figure shows all the individual main 
effects, pairwise synergies, three-way synergies, and the total 
mutual information for a three-way model. The best IG part i t ion 
model (figure 5A) includes SNPs rsll465421 from CD209, 
rsl544410 from VDR, and rs7975232 from VDR. It has the total 
mutual information 6.695% and three-way IG part ition 3.668%. This 
model is clearly a mixed epistasis model by possessing both strong 
collective pairwise interactions and a pure three-way interaction. 
The best IG str ict model (figure 5B) involves SNPs from three differ- 
ent genes, rs5743836 from TLR9, rs2305619 from PTX3, and 
rs4804803 from CD209. All three SNPs have very limited main 
effects and pairwise synergies. However, the strict three-way infor- 
mation gain is stronger than all other lower-order effects combined 

together, and contributes ^ stmt 



I 



= 54.23% to the total associ- 



ation. An explicit test of epistasis 31 was used to assess the statistical 
significance of those observations. Instead of shuffling the case- 
control class in a standard permutation test, the explicit test ran- 
domly shuffled genotypes of samples within each class. Therefore, 
the genotype frequencies within each class remained fixed. This 
preserved the independent main effects while randomizing any 
non-linear interactions, and provided the null hypothesis that the 
only genetic effects in the data were linear and additive. We per- 
formed the explicit test 1000 times for each observation, and both 
best models were statistically significant (p = 0.001 for the best 
IGp art i t i on and p = 0.008 for the best IG str i ct ). 

DISCUSSION 

Epistasis is recognized as playing an important role in the 
genetic architecture of complex traits such as common human 
diseases. Quantifying interaction effects among multiple loci 
throughout the human genome has become the major focus of 
current research for understanding the complex relationship 
between genetic variations and phenotypic traits. 8 9 12 However, 
this task is challenging due to the fact that first enumerating all 
possible combinations of genetic variants in a dataset of moder- 
ate size is computationally infeasible, and second it is difficult to 
separate the additive effects and the synergistic effects among 
multiple genetic factors. The majority of epistasis studies focus 
on pairwise interactions and rarely look beyond interactions of 
orders higher than two. It is computationally intensive to enu- 
merate three-way combinations in human genome data and, 
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Figure 5 The models with the 
highest three-way information gain in 
the tuberculosis data. (A) The best 
model of IG par tition with permutation 
testing significance p = 0.001, and (B) 
the best model of IG strict with 
permutation testing significance 
p = 0.008. Each circle is a single 
nucleotide polymorphism (SNP) with 
its name and main effect strength. An 
edge represents a pairwise epistasis 
with its strength labeled. I is the total 
mutual information between 
combining three SNPs and the 
case-control status, and IG is the 
three-way information gain. All 
measures are normalized by dividing 
the entropy of the case-control status 
H(C), and thus give the percentage of 
predicting information on the 
phenotype status. 



partition = o.ooo W> 
/G strjct = 2.078% 




/ =4.516% 
'Cpartition = 2.978% 
^strict = 2.449% 




furthermore, not many good methods of quantifying three-way 
epistasis are proposed in the literature. 14 

In the present study, we proposed a fast, model-free, and non- 
parametric measure to detect and characterize three-way epi- 
static interactions. It is a natural extension of the pairwise 
synergy measure using information gain by measuring the total 
three-way mutual information and then subtracting three main 
effects and three pairwise synergies. Our approach was shown 
to be able to detect pure three-way epistasis, that is, no observ- 
able effect at either the two-way or one-way level, in both syn- 
thetic and real population-based genetics data. Our study is not 
the first attempt to quantify three-way epistasis using 
information-theoretic measures. We compared our measure 
IG s trict to a previously published one IGp art i t i on .^ Both measures 
exclude main effects. However, IG part i t ion is biased towards 
three-way models that possess strong pairwise interactions. Our 
IG str i ct excludes all one-way and two-way effects and is able to 
detect pure three-way synergy that is only observable when all 
three attributes are considered together. 

Both three-way synergy measures were applied to a pulmon- 
ary tuberculosis dataset from a West African population. 
Several potentially relevant innate immunity genes, CD209, 
PTX3, TLRs, and VDR, were included in these data to investi- 
gate their role in pulmonary tuberculosis susceptibility. The 
dendritic cell-specific intercellular adhesion molecule-3- 
grabbing non-integrin (DC-SIGN or CD209) is a crucial M 
tuberculosis receptor expressed on the surface of dendritic 
cells, and is involved in the initiation of innate immune 
response through identification of potential infectious agents. 
CD209 has previously been found to be associated with tuber- 
culosis. 33 34 Long pentraxin 3 (PTX3) is produced by innate 
immunity cells and vascular cells in response to proinflamma- 
tory signals and TLR engagement. 35 PTX3 levels have been 
shown to be correlated to the degree of infection in lungs, and 
PTX3 plasma levels can be monitored to measure treatment 
efficacy because PTX3 concentration decreases as an infection 
is mitigated. 36 TLRs are a family of receptors that are a key 
component in the innate immune system. TLRs recognize 
pathogenic molecules and control host immune response 
against them, and have been extensively proved to impact sus- 
ceptibility to infectious and inflammatory diseases. 37 38 The 
VDR has been shown to be linked to TLRs. It was found that 



TLR activation of human macrophages upregulated expression 
of the VDR and the vitamin D 1 -hydroxylase genes. 39 This link 
suggests that the difference among human populations' ability 
in producing vitamin D may contribute to susceptibility to 
microbial infection. Furthermore, a case-control and family 
study reported the association between VDR polymorphisms 
and susceptibility to tuberculosis. 40 

The strongest three-way epistasis models we found (figure 5) 
could further extend and strengthen the understanding of how 
those relevant genes might work in a synergistic way. Although 
the synergistic effects were captured in a statistical manner, they 
may indicate either direct or indirect biological relationships 
among those genetic factors. In particular, the strongest IG part ition 
model shows a 'nested' epistasis hierarchy with both strong pair- 
wise interactions and a strong pure three-way interaction. These 
three SNPs are from VDR and CD209, which indicates a poten- 
tial correlation between these two genes. More interestingly, the 
strongest IG str ict model shows a three-way synergy among TLR9, 
PTX3, and CD209. There have been studies reporting correla- 
tions and molecular interactions between TLRs and CD209. 41 
However, no published research has indicated the three-way bio- 
logical synergy among all three genes. Our findings may suggest 
that multiple pathways interweave in the innate immune system 
to defend the human body against M tuberculosis, and the defi- 
ciencies in all those three genetic factors greatly increase the risk 
of developing tuberculosis. We believe that with further biological 
validations, our findings could be very helpful in predicting high- 
risk M tuberculosis-'miected individuals and preventing their 
tuberculosis clinical developments. 

Investigating high-order genetic interactions is an arduous task, 
and yet essential for understanding the complex genetic architec- 
ture of human diseases. The effectiveness of our information-gain 
approach in detecting three-way interactions was verified using 
both synthetic and real genetics data. Note that our measurement 
is scalable regardless of the size of genetics data. However, when 
large-scale genome-wide data are considered and exhaustive enu- 
meration of all possible three-way genetic attribute combinations is 
infeasible, data pre-screening using intelligent data-mining techni- 
ques or biology knowledge will be required. There are some inter- 
esting venues to extend our approach in future studies. First, it is 
important to study the genetic interactions on continuous traits. In 
that case, the probability density functions for continuous random 
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variables will be used to replace the probability mass function in 
current discrete information-theoretic measures. Second, it will be 
interesting to extend the measures on synergies higher than three 
way. However, as more attributes are involved, the interaction hier- 
archy gets more complicated. More carefully designed mathemat- 
ical measures are required. We anticipate that designing powerful 
and efficient methods to quantify high-order epistasis has the great 
potential in improving disease treatment and healthcare by reveal- 
ing the genetic complexity of common human diseases. 

Correction notice This paper has been corrected since it was published Online 
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